Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 59
Filtrar
1.
bioRxiv ; 2023 Dec 01.
Artículo en Inglés | MEDLINE | ID: mdl-38077089

RESUMEN

Apes possess two sex chromosomes-the male-specific Y and the X shared by males and females. The Y chromosome is crucial for male reproduction, with deletions linked to infertility. The X chromosome carries genes vital for reproduction and cognition. Variation in mating patterns and brain function among great apes suggests corresponding differences in their sex chromosome structure and evolution. However, due to their highly repetitive nature and incomplete reference assemblies, ape sex chromosomes have been challenging to study. Here, using the state-of-the-art experimental and computational methods developed for the telomere-to-telomere (T2T) human genome, we produced gapless, complete assemblies of the X and Y chromosomes for five great apes (chimpanzee, bonobo, gorilla, Bornean and Sumatran orangutans) and a lesser ape, the siamang gibbon. These assemblies completely resolved ampliconic, palindromic, and satellite sequences, including the entire centromeres, allowing us to untangle the intricacies of ape sex chromosome evolution. We found that, compared to the X, ape Y chromosomes vary greatly in size and have low alignability and high levels of structural rearrangements. This divergence on the Y arises from the accumulation of lineage-specific ampliconic regions and palindromes (which are shared more broadly among species on the X) and from the abundance of transposable elements and satellites (which have a lower representation on the X). Our analysis of Y chromosome genes revealed lineage-specific expansions of multi-copy gene families and signatures of purifying selection. In summary, the Y exhibits dynamic evolution, while the X is more stable. Finally, mapping short-read sequencing data from >100 great ape individuals revealed the patterns of diversity and selection on their sex chromosomes, demonstrating the utility of these reference assemblies for studies of great ape evolution. These complete sex chromosome assemblies are expected to further inform conservation genetics of nonhuman apes, all of which are endangered species.

2.
Genome Biol ; 24(1): 221, 2023 10 05.
Artículo en Inglés | MEDLINE | ID: mdl-37798733

RESUMEN

Genomic benchmark datasets are essential to driving the field of genomics and bioinformatics. They provide a snapshot of the performances of sequencing technologies and analytical methods and highlight future challenges. However, they depend on sequencing technology, reference genome, and available benchmarking methods. Thus, creating a genomic benchmark dataset is laborious and highly challenging, often involving multiple sequencing technologies, different variant calling tools, and laborious manual curation. In this review, we discuss the available benchmark datasets and their utility. Additionally, we focus on the most recent benchmark of genes with medical relevance and challenging genomic complexity.


Asunto(s)
Benchmarking , Genómica , Genómica/métodos , Biología Computacional/métodos , Genoma , Secuenciación de Nucleótidos de Alto Rendimiento/métodos
3.
Nat Commun ; 14(1): 5164, 2023 08 24.
Artículo en Inglés | MEDLINE | ID: mdl-37620373

RESUMEN

Long-read sequencing has dramatically increased our understanding of human genome variation. Here, we demonstrate that long-read technology can give new insights into the genomic architecture of individual cells. Clonally expanded CD8+ T-cells from a human donor were subjected to droplet-based multiple displacement amplification (dMDA) to generate long molecules with reduced bias. PacBio sequencing generated up to 40% genome coverage per single-cell, enabling detection of single nucleotide variants (SNVs), structural variants (SVs), and tandem repeats, also in regions inaccessible by short reads. 28 somatic SNVs were detected, including one case of mitochondrial heteroplasmy. 5473 high-confidence SVs/cell were discovered, a sixteen-fold increase compared to Illumina-based results from clonally related cells. Single-cell de novo assembly generated a genome size of up to 598 Mb and 1762 (12.8%) complete gene models. In summary, our work shows the promise of long-read sequencing toward characterization of the full spectrum of genetic variation in single cells.


Asunto(s)
Genoma Humano , Genómica , Humanos , Tamaño del Genoma , Genoma Humano/genética , Linfocitos T CD8-positivos , Ciclo Celular
4.
Nature ; 621(7978): 344-354, 2023 Sep.
Artículo en Inglés | MEDLINE | ID: mdl-37612512

RESUMEN

The human Y chromosome has been notoriously difficult to sequence and assemble because of its complex repeat structure that includes long palindromes, tandem repeats and segmental duplications1-3. As a result, more than half of the Y chromosome is missing from the GRCh38 reference sequence and it remains the last human chromosome to be finished4,5. Here, the Telomere-to-Telomere (T2T) consortium presents the complete 62,460,029-base-pair sequence of a human Y chromosome from the HG002 genome (T2T-Y) that corrects multiple errors in GRCh38-Y and adds over 30 million base pairs of sequence to the reference, showing the complete ampliconic structures of gene families TSPY, DAZ and RBMY; 41 additional protein-coding genes, mostly from the TSPY family; and an alternating pattern of human satellite 1 and 3 blocks in the heterochromatic Yq12 region. We have combined T2T-Y with a previous assembly of the CHM13 genome4 and mapped available population variation, clinical variants and functional genomics data to produce a complete and comprehensive reference sequence for all 24 human chromosomes.


Asunto(s)
Cromosomas Humanos Y , Genómica , Análisis de Secuencia de ADN , Humanos , Secuencia de Bases , Cromosomas Humanos Y/genética , ADN Satélite/genética , Variación Genética/genética , Genética de Población , Genómica/métodos , Genómica/normas , Heterocromatina/genética , Familia de Multigenes/genética , Estándares de Referencia , Duplicaciones Segmentarias en el Genoma/genética , Análisis de Secuencia de ADN/normas , Secuencias Repetidas en Tándem/genética , Telómero/genética
5.
Nat Methods ; 20(8): 1213-1221, 2023 08.
Artículo en Inglés | MEDLINE | ID: mdl-37365340

RESUMEN

Advancements in sequencing technologies and assembly methods enable the regular production of high-quality genome assemblies characterizing complex regions. However, challenges remain in efficiently interpreting variation at various scales, from smaller tandem repeats to megabase rearrangements, across many human genomes. We present a PanGenome Research Tool Kit (PGR-TK) enabling analyses of complex pangenome structural and haplotype variation at multiple scales. We apply the graph decomposition methods in PGR-TK to the class II major histocompatibility complex demonstrating the importance of the human pangenome for analyzing complicated regions. Moreover, we investigate the Y-chromosome genes, DAZ1/DAZ2/DAZ3/DAZ4, of which structural variants have been linked to male infertility, and X-chromosome genes OPN1LW and OPN1MW linked to eye disorders. We further showcase PGR-TK across 395 complex repetitive medically important genes. This highlights the power of PGR-TK to resolve complex variation in regions of the genome that were previously too complex to analyze.


Asunto(s)
Genoma Humano , Genómica , Masculino , Humanos , Complejo Mayor de Histocompatibilidad
6.
Cell Genom ; 2(5)2022 May.
Artículo en Inglés | MEDLINE | ID: mdl-36452119

RESUMEN

Genome in a Bottle benchmarks are widely used to help validate clinical sequencing pipelines and develop variant calling and sequencing methods. Here we use accurate linked and long reads to expand benchmarks in 7 samples to include difficult-to-map regions and segmental duplications that are challenging for short reads. These benchmarks add more than 300,000 SNVs and 50,000 insertions or deletions (indels) and include 16% more exonic variants, many in challenging, clinically relevant genes not covered previously, such as PMS2. For HG002, we include 92% of the autosomal GRCh38 assembly while excluding regions problematic for benchmarking small variants, such as copy number variants, that should not have been in the previous version, which included 85% of GRCh38. It identifies eight times more false negatives in a short read variant call set relative to our previous benchmark. We demonstrate that this benchmark reliably identifies false positives and false negatives across technologies, enabling ongoing methods development.

7.
Nature ; 611(7936): 519-531, 2022 Nov.
Artículo en Inglés | MEDLINE | ID: mdl-36261518

RESUMEN

The current human reference genome, GRCh38, represents over 20 years of effort to generate a high-quality assembly, which has benefitted society1,2. However, it still has many gaps and errors, and does not represent a biological genome as it is a blend of multiple individuals3,4. Recently, a high-quality telomere-to-telomere reference, CHM13, was generated with the latest long-read technologies, but it was derived from a hydatidiform mole cell line with a nearly homozygous genome5. To address these limitations, the Human Pangenome Reference Consortium formed with the goal of creating high-quality, cost-effective, diploid genome assemblies for a pangenome reference that represents human genetic diversity6. Here, in our first scientific report, we determined which combination of current genome sequencing and assembly approaches yield the most complete and accurate diploid genome assembly with minimal manual curation. Approaches that used highly accurate long reads and parent-child data with graph-based haplotype phasing during assembly outperformed those that did not. Developing a combination of the top-performing methods, we generated our first high-quality diploid reference assembly, containing only approximately four gaps per chromosome on average, with most chromosomes within ±1% of the length of CHM13. Nearly 48% of protein-coding genes have non-synonymous amino acid changes between haplotypes, and centromeric regions showed the highest diversity. Our findings serve as a foundation for assembling near-complete diploid human genomes at scale for a pangenome reference to capture global genetic variation from single nucleotides to structural rearrangements.


Asunto(s)
Mapeo Cromosómico , Diploidia , Genoma Humano , Genómica , Humanos , Mapeo Cromosómico/normas , Genoma Humano/genética , Haplotipos/genética , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/normas , Análisis de Secuencia de ADN/métodos , Análisis de Secuencia de ADN/normas , Estándares de Referencia , Genómica/métodos , Genómica/normas , Cromosomas Humanos/genética , Variación Genética/genética
8.
Science ; 376(6588): 44-53, 2022 04.
Artículo en Inglés | MEDLINE | ID: mdl-35357919

RESUMEN

Since its initial release in 2000, the human reference genome has covered only the euchromatic fraction of the genome, leaving important heterochromatic regions unfinished. Addressing the remaining 8% of the genome, the Telomere-to-Telomere (T2T) Consortium presents a complete 3.055 billion-base pair sequence of a human genome, T2T-CHM13, that includes gapless assemblies for all chromosomes except Y, corrects errors in the prior references, and introduces nearly 200 million base pairs of sequence containing 1956 gene predictions, 99 of which are predicted to be protein coding. The completed regions include all centromeric satellite arrays, recent segmental duplications, and the short arms of all five acrocentric chromosomes, unlocking these complex regions of the genome to variational and functional studies.


Asunto(s)
Genoma Humano , Proyecto Genoma Humano , Análisis de Secuencia de ADN/normas , Línea Celular , Cromosomas Artificiales Bacterianos/genética , Cromosomas Humanos/genética , Humanos , Valores de Referencia
9.
Science ; 376(6588): eabl3533, 2022 04.
Artículo en Inglés | MEDLINE | ID: mdl-35357935

RESUMEN

Compared to its predecessors, the Telomere-to-Telomere CHM13 genome adds nearly 200 million base pairs of sequence, corrects thousands of structural errors, and unlocks the most complex regions of the human genome for clinical and functional study. We show how this reference universally improves read mapping and variant calling for 3202 and 17 globally diverse samples sequenced with short and long reads, respectively. We identify hundreds of thousands of variants per sample in previously unresolved regions, showcasing the promise of the T2T-CHM13 reference for evolutionary and biomedical discovery. Simultaneously, this reference eliminates tens of thousands of spurious variants per sample, including reduction of false positives in 269 medically relevant genes by up to a factor of 12. Because of these improvements in variant discovery coupled with population and functional genomic resources, T2T-CHM13 is positioned to replace GRCh38 as the prevailing reference for human genetics.


Asunto(s)
Variación Genética , Genoma Humano , Genómica/normas , Análisis de Secuencia de ADN/normas , Humanos , Estándares de Referencia
11.
Nat Biotechnol ; 40(5): 672-680, 2022 05.
Artículo en Inglés | MEDLINE | ID: mdl-35132260

RESUMEN

The repetitive nature and complexity of some medically relevant genes poses a challenge for their accurate analysis in a clinical setting. The Genome in a Bottle Consortium has provided variant benchmark sets, but these exclude nearly 400 medically relevant genes due to their repetitiveness or polymorphic complexity. Here, we characterize 273 of these 395 challenging autosomal genes using a haplotype-resolved whole-genome assembly. This curated benchmark reports over 17,000 single-nucleotide variations, 3,600 insertions and deletions and 200 structural variations each for human genome reference GRCh37 and GRCh38 across HG002. We show that false duplications in either GRCh37 or GRCh38 result in reference-specific, missed variants for short- and long-read technologies in medically relevant genes, including CBS, CRYAA and KCNE1. When masking these false duplications, variant recall can improve from 8% to 100%. Forming benchmarks from a haplotype-resolved whole-genome assembly may become a prototype for future benchmarks covering the whole genome.


Asunto(s)
Genoma Humano , Genoma Humano/genética , Haplotipos/genética , Humanos , Análisis de Secuencia de ADN
12.
Lancet Digit Health ; 3(12): e795-e805, 2021 12.
Artículo en Inglés | MEDLINE | ID: mdl-34756569

RESUMEN

BACKGROUND: Kidney allograft failure is a common cause of end-stage renal disease. We aimed to develop a dynamic artificial intelligence approach to enhance risk stratification for kidney transplant recipients by generating continuously refined predictions of survival using updates of clinical data. METHODS: In this observational study, we used data from adult recipients of kidney transplants from 18 academic transplant centres in Europe, the USA, and South America, and a cohort of patients from six randomised controlled trials. The development cohort comprised patients from four centres in France, with all other patients included in external validation cohorts. To build deeply phenotyped cohorts of transplant recipients, the following data were collected in the development cohort: clinical, histological, immunological variables, and repeated measurements of estimated glomerular filtration rate (eGFR) and proteinuria (measured using the proteinuria to creatininuria ratio). To develop a dynamic prediction system based on these clinical assessments and repeated measurements, we used a Bayesian joint models-an artificial intelligence approach. The prediction performances of the model were assessed via discrimination, through calculation of the area under the receiver operator curve (AUC), and calibration. This study is registered with ClinicalTrials.gov, NCT04258891. FINDINGS: 13 608 patients were included (3774 in the development cohort and 9834 in the external validation cohorts) and contributed 89 328 patient-years of data, and 416 510 eGFR and proteinuria measurements. Bayesian joint models showed that recipient immunological profile, allograft interstitial fibrosis and tubular atrophy, allograft inflammation, and repeated measurements of eGFR and proteinuria were independent risk factors for allograft survival. The final model showed accurate calibration and very high discrimination in the development cohort (overall dynamic AUC 0·857 [95% CI 0·847-0·866]) with a persistent improvement in AUCs for each new repeated measurement (from 0·780 [0·768-0·794] to 0·926 [0·917-0·932]; p<0·0001). The predictive performance was confirmed in the external validation cohorts from Europe (overall AUC 0·845 [0·837-0·854]), the USA (overall AUC 0·820 [0·808-0·831]), South America (overall AUC 0·868 [0·856-0·880]), and the cohort of patients from randomised controlled trials (overall AUC 0·857 [0·840-0·875]). INTERPRETATION: Because of its dynamic design, this model can be continuously updated and holds value as a bedside tool that could refine the prognostic judgements of clinicians in everyday practice, hence enhancing precision medicine in the transplant setting. FUNDING: MSD Avenir, French National Institute for Health and Medical Research, and Bettencourt Schueller Foundation.


Asunto(s)
Aloinjertos , Inteligencia Artificial , Trasplante de Riñón , Riñón/cirugía , Modelos Biológicos , Complicaciones Posoperatorias , Insuficiencia Renal/diagnóstico , Adulto , Área Bajo la Curva , Teorema de Bayes , Femenino , Tasa de Filtración Glomerular , Humanos , Masculino , Persona de Mediana Edad , Pronóstico , Proteinuria , Insuficiencia Renal/cirugía , Reproducibilidad de los Resultados , Medición de Riesgo , Receptores de Trasplantes
13.
F1000Res ; 10: 281, 2021.
Artículo en Inglés | MEDLINE | ID: mdl-34322225

RESUMEN

We describe the use of high-fidelity single molecule sequencing to assemble the genome of the psychoactive Psilocybe cubensis mushroom. The genome is 46.6Mb, 46% GC, and in 32 contigs with an N50 of 3.3Mb. The BUSCO completeness scores are 97.6% with 1.2% duplicates. The Psilocybin synthesis cluster exists in a single 3.2Mb contig. The dataset is available from NCBI BioProject with accessions PRJNA687911 and PRJNA700437.


Asunto(s)
Agaricales , Psilocybe , Agaricales/genética , Psilocibina
14.
Nat Commun ; 12(1): 1660, 2021 03 12.
Artículo en Inglés | MEDLINE | ID: mdl-33712587

RESUMEN

In less than nine months, the Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) killed over a million people, including >25,000 in New York City (NYC) alone. The COVID-19 pandemic caused by SARS-CoV-2 highlights clinical needs to detect infection, track strain evolution, and identify biomarkers of disease course. To address these challenges, we designed a fast (30-minute) colorimetric test (LAMP) for SARS-CoV-2 infection from naso/oropharyngeal swabs and a large-scale shotgun metatranscriptomics platform (total-RNA-seq) for host, viral, and microbial profiling. We applied these methods to clinical specimens gathered from 669 patients in New York City during the first two months of the outbreak, yielding a broad molecular portrait of the emerging COVID-19 disease. We find significant enrichment of a NYC-distinctive clade of the virus (20C), as well as host responses in interferon, ACE, hematological, and olfaction pathways. In addition, we use 50,821 patient records to find that renin-angiotensin-aldosterone system inhibitors have a protective effect for severe COVID-19 outcomes, unlike similar drugs. Finally, spatial transcriptomic data from COVID-19 patient autopsy tissues reveal distinct ACE2 expression loci, with macrophage and neutrophil infiltration in the lungs. These findings can inform public health and may help develop and drive SARS-CoV-2 diagnostic, prevention, and treatment strategies.


Asunto(s)
COVID-19/genética , COVID-19/virología , SARS-CoV-2/genética , Adulto , Anciano , Antagonistas de Receptores de Angiotensina/farmacología , Inhibidores de la Enzima Convertidora de Angiotensina/farmacología , Antivirales/farmacología , COVID-19/epidemiología , Prueba de Ácido Nucleico para COVID-19 , Interacciones Farmacológicas , Femenino , Perfilación de la Expresión Génica , Genoma Viral , Antígenos HLA/genética , Interacciones Microbiota-Huesped/efectos de los fármacos , Interacciones Microbiota-Huesped/genética , Humanos , Masculino , Persona de Mediana Edad , Técnicas de Diagnóstico Molecular , Ciudad de Nueva York/epidemiología , Técnicas de Amplificación de Ácido Nucleico , Pandemias , RNA-Seq , SARS-CoV-2/clasificación , SARS-CoV-2/efectos de los fármacos , Tratamiento Farmacológico de COVID-19
15.
Kidney Int ; 99(1): 186-197, 2021 01.
Artículo en Inglés | MEDLINE | ID: mdl-32781106

RESUMEN

Although the gold standard of monitoring kidney transplant function relies on glomerular filtration rate (GFR), little is known about GFR trajectories after transplantation, their determinants, and their association with outcomes. To evaluate these parameters we examined kidney transplant recipients receiving care at 15 academic centers. Patients underwent prospective monitoring of estimated GFR (eGFR) measurements, with assessment of clinical, functional, histological and immunological parameters. Additional validation took place in seven randomized controlled trials that included a total of 14,132 patients with 403,497 eGFR measurements. After a median follow-up of 6.5 years, 1,688 patients developed end-stage kidney disease. Using unsupervised latent class mixed models, we identified eight distinct eGFR trajectories. Multinomial regression models identified seven significant determinants of eGFR trajectories including donor age, eGFR, proteinuria, and several significant histological features: graft scarring, graft interstitial inflammation and tubulitis, microcirculation inflammation, and circulating anti-HLA donor specific antibodies. The eGFR trajectories were associated with progression to end stage kidney disease. These trajectories, their determinants and respective associations with end stage kidney disease were similar across cohorts, as well as in diverse clinical scenarios, therapeutic eras and in the seven randomized control trials. Thus, our results provide the basis for a trajectory-based assessment of kidney transplant patients for risk stratification and monitoring.


Asunto(s)
Fallo Renal Crónico , Trasplante de Riñón , Tasa de Filtración Glomerular , Humanos , Fallo Renal Crónico/diagnóstico , Fallo Renal Crónico/cirugía , Trasplante de Riñón/efectos adversos , Estudios Prospectivos
16.
Bioinformatics ; 37(3): 413-415, 2021 04 20.
Artículo en Inglés | MEDLINE | ID: mdl-32766814

RESUMEN

SUMMARY: Ribbon is an alignment visualization tool that shows how alignments are positioned within both the reference and read contexts, giving an intuitive view that enables a better understanding of structural variants and the read evidence supporting them. Ribbon was born out of a need to curate complex structural variant calls and determine whether each was well supported by long-read evidence, and it uses the same intuitive visualization method to shed light on contig alignments from genome-to-genome comparisons. AVAILABILITY AND IMPLEMENTATION: Ribbon is freely available online at http://genomeribbon.com/ and is open-source at https://github.com/marianattestad/ribbon. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Genómica , Programas Informáticos , Genoma
17.
Nat Biotechnol ; 39(3): 309-312, 2021 03.
Artículo en Inglés | MEDLINE | ID: mdl-33288905

RESUMEN

Haplotype-resolved or phased genome assembly provides a complete picture of genomes and their complex genetic variations. However, current algorithms for phased assembly either do not generate chromosome-scale phasing or require pedigree information, which limits their application. We present a method named diploid assembly (DipAsm) that uses long, accurate reads and long-range conformation data for single individuals to generate a chromosome-scale phased assembly within 1 day. Applied to four public human genomes, PGP1, HG002, NA12878 and HG00733, DipAsm produced haplotype-resolved assemblies with minimum contig length needed to cover 50% of the known genome (NG50) up to 25 Mb and phased ~99.5% of heterozygous sites at 98-99% accuracy, outperforming other approaches in terms of both contiguity and phasing completeness. We demonstrate the importance of chromosome-scale phased assemblies for the discovery of structural variants (SVs), including thousands of new transposon insertions, and of highly polymorphic and medically important regions such as the human leukocyte antigen (HLA) and killer cell immunoglobulin-like receptor (KIR) regions. DipAsm will facilitate high-quality precision medicine and studies of individual haplotype variation and population diversity.


Asunto(s)
Cromosomas Humanos , Genoma Humano , Haplotipos , Algoritmos , Heterocigoto , Humanos , Polimorfismo de Nucleótido Simple
18.
Genome Biol ; 21(1): 290, 2020 12 01.
Artículo en Inglés | MEDLINE | ID: mdl-33261648

RESUMEN

BACKGROUND: One ongoing concern about CRISPR-Cas9 genome editing is that unspecific guide RNA (gRNA) binding may induce off-target mutations. However, accurate prediction of CRISPR-Cas9 off-target activity is challenging. Here, we present SMRT-OTS and Nano-OTS, two novel, amplification-free, long-read sequencing protocols for detection of gRNA-driven digestion of genomic DNA by Cas9 in vitro. RESULTS: The methods are assessed using the human cell line HEK293, re-sequenced at 18x coverage using highly accurate HiFi SMRT reads. SMRT-OTS and Nano-OTS are first applied to three different gRNAs targeting HEK293 genomic DNA, resulting in a set of 55 high-confidence gRNA cleavage sites identified by both methods. Twenty-five of these sites are not reported by off-target prediction software, either because they contain four or more single nucleotide mismatches or insertion/deletion mismatches, as compared with the human reference. Additional experiments reveal that 85% of Cas9 cleavage sites are also found by other in vitro-based methods and that on- and off-target sites are detectable in gene bodies where short-reads fail to uniquely align. Even though SMRT-OTS and Nano-OTS identify several sites with previously validated off-target editing activity in cells, our own CRISPR-Cas9 editing experiments in human fibroblasts do not give rise to detectable off-target mutations at the in vitro-predicted sites. However, indel and structural variation events are enriched at the on-target sites. CONCLUSIONS: Amplification-free long-read sequencing reveals Cas9 cleavage sites in vitro that would have been difficult to predict using computational tools, including in dark genomic regions inaccessible by short-read sequencing.


Asunto(s)
Secuencia de Bases , Sistemas CRISPR-Cas , Biología Computacional/métodos , Edición Génica/métodos , ADN , Variación Genética , Genómica , Células HEK293 , Humanos , Mutación , Secuenciación de Nanoporos , ARN Guía de Kinetoplastida , Análisis de Secuencia de ADN , Programas Informáticos
19.
Nat Commun ; 11(1): 4794, 2020 09 22.
Artículo en Inglés | MEDLINE | ID: mdl-32963235

RESUMEN

Most human genomes are characterized by aligning individual reads to the reference genome, but accurate long reads and linked reads now enable us to construct accurate, phased de novo assemblies. We focus on a medically important, highly variable, 5 million base-pair (bp) region where diploid assembly is particularly useful - the Major Histocompatibility Complex (MHC). Here, we develop a human genome benchmark derived from a diploid assembly for the openly-consented Genome in a Bottle sample HG002. We assemble a single contig for each haplotype, align them to the reference, call phased small and structural variants, and define a small variant benchmark for the MHC, covering 94% of the MHC and 22368 variants smaller than 50 bp, 49% more variants than a mapping-based benchmark. This benchmark reliably identifies errors in mapping-based callsets, and enables performance assessment in regions with much denser, complex variation than regions covered by previous benchmarks.


Asunto(s)
Diploidia , Complejo Mayor de Histocompatibilidad/genética , Benchmarking , Línea Celular , Variación Genética , Genoma Humano , Haplotipos , Humanos
20.
Nat Commun ; 11(1): 2288, 2020 05 08.
Artículo en Inglés | MEDLINE | ID: mdl-32385271

RESUMEN

Improvements in long-read data and scaffolding technologies have enabled rapid generation of reference-quality assemblies for complex genomes. Still, an assessment of critical sequence depth and read length is important for allocating limited resources. To this end, we have generated eight assemblies for the complex genome of the maize inbred line NC358 using PacBio datasets ranging from 20 to 75 × genomic depth and with N50 subread lengths of 11-21 kb. Assemblies with ≤30 × depth and N50 subread length of 11 kb are highly fragmented, with even low-copy genic regions showing degradation at 20 × depth. Distinct sequence-quality thresholds are observed for complete assembly of genes, transposable elements, and highly repetitive genomic features such as telomeres, heterochromatic knobs, and centromeres. In addition, we show high-quality optical maps can dramatically improve contiguity in even our most fragmented base assembly. This study provides a useful resource allocation reference to the community as long-read technologies continue to mature.


Asunto(s)
Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Endogamia , Zea mays/genética , Secuencia de Bases , Elementos Transponibles de ADN/genética , Genoma de Planta , Secuencias Repetitivas de Ácidos Nucleicos/genética
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...